In this paper, we discuss an imitation learning based method for reducing the calibration error for a mixed reality system consisting of a vision sensor and a projector. Unlike a head mounted display, in this setup, augmented information is available to a human subject via the projection of a scene into the real world. Inherently, the camera and projector need to be calibrated as a stereo setup to project accurate information in 3D space. Previous calibration processes require multiple recording and parameter tuning steps to achieve the desired calibration, which is usually time consuming process. In order to avoid such tedious calibration, we train a CNN model to iteratively correct the extrinsic offset given a QR code and a projected pattern. We discuss the overall system setup, data collection for training, and results of the auto-correction model.
translated by 谷歌翻译
Language-conditioned policies allow robots to interpret and execute human instructions. Learning such policies requires a substantial investment with regards to time and compute resources. Still, the resulting controllers are highly device-specific and cannot easily be transferred to a robot with different morphology, capability, appearance or dynamics. In this paper, we propose a sample-efficient approach for training language-conditioned manipulation policies that allows for rapid transfer across different types of robots. By introducing a novel method, namely Hierarchical Modularity, and adopting supervised attention across multiple sub-modules, we bridge the divide between modular and end-to-end learning and enable the reuse of functional building blocks. In both simulated and real world robot manipulation experiments, we demonstrate that our method outperforms the current state-of-the-art methods and can transfer policies across 4 different robots in a sample-efficient manner. Finally, we show that the functionality of learned sub-modules is maintained beyond the training process and can be used to introspect the robot decision-making process. Code is available at https://github.com/ir-lab/ModAttn.
translated by 谷歌翻译
Recently, CLIP has been applied to pixel-level zero-shot learning tasks via a two-stage scheme. The general idea is to first generate class-agnostic region proposals and then feed the cropped proposal regions to CLIP to utilize its image-level zero-shot classification capability. While effective, such a scheme requires two image encoders, one for proposal generation and one for CLIP, leading to a complicated pipeline and high computational cost. In this work, we pursue a simpler-and-efficient one-stage solution that directly extends CLIP's zero-shot prediction capability from image to pixel level. Our investigation starts with a straightforward extension as our baseline that generates semantic masks by comparing the similarity between text and patch embeddings extracted from CLIP. However, such a paradigm could heavily overfit the seen classes and fail to generalize to unseen classes. To handle this issue, we propose three simple-but-effective designs and figure out that they can significantly retain the inherent zero-shot capacity of CLIP and improve pixel-level generalization ability. Incorporating those modifications leads to an efficient zero-shot semantic segmentation system called ZegCLIP. Through extensive experiments on three public benchmarks, ZegCLIP demonstrates superior performance, outperforming the state-of-the-art methods by a large margin under both "inductive" and "transductive" zero-shot settings. In addition, compared with the two-stage method, our one-stage ZegCLIP achieves a speedup of about 5 times faster during inference. We release the code at https://github.com/ZiqinZhou66/ZegCLIP.git.
translated by 谷歌翻译
High-dimensional linear regression model is the most popular statistical model for high-dimensional data, but it is quite a challenging task to achieve a sparse set of regression coefficients. In this paper, we propose a simple heuristic algorithm to construct sparse high-dimensional linear regression models, which is adapted from the shortest solution-guided decimation algorithm and is referred to as ASSD. This algorithm constructs the support of regression coefficients under the guidance of the least-squares solution of the recursively decimated linear equations, and it applies an early-stopping criterion and a second-stage thresholding procedure to refine this support. Our extensive numerical results demonstrate that ASSD outperforms LASSO, vector approximate message passing, and two other representative greedy algorithms in solution accuracy and robustness. ASSD is especially suitable for linear regression problems with highly correlated measurement matrices encountered in real-world applications.
translated by 谷歌翻译
The power of Deep Neural Networks (DNNs) depends heavily on the training data quantity, quality and diversity. However, in many real scenarios, it is costly and time-consuming to collect and annotate large-scale data. This has severely hindered the application of DNNs. To address this challenge, we explore a new task of dataset expansion, which seeks to automatically create new labeled samples to expand a small dataset. To this end, we present a Guided Imagination Framework (GIF) that leverages the recently developed big generative models (e.g., DALL-E2) and reconstruction models (e.g., MAE) to "imagine" and create informative new data from seed data to expand small datasets. Specifically, GIF conducts imagination by optimizing the latent features of seed data in a semantically meaningful space, which are fed into the generative models to generate photo-realistic images with new contents. For guiding the imagination towards creating samples useful for model training, we exploit the zero-shot recognition ability of CLIP and introduce three criteria to encourage informative sample generation, i.e., prediction consistency, entropy maximization and diversity promotion. With these essential criteria as guidance, GIF works well for expanding datasets in different domains, leading to 29.9% accuracy gain on average over six natural image datasets, and 12.3% accuracy gain on average over three medical image datasets. The source code will be released: \url{https://github.com/Vanint/DatasetExpansion}.
translated by 谷歌翻译
Online forms are widely used to collect data from human and have a multi-billion market. Many software products provide online services for creating semi-structured forms where questions and descriptions are organized by pre-defined structures. However, the design and creation process of forms is still tedious and requires expert knowledge. To assist form designers, in this work we present FormLM to model online forms (by enhancing pre-trained language model with form structural information) and recommend form creation ideas (including question / options recommendations and block type suggestion). For model training and evaluation, we collect the first public online form dataset with 62K online forms. Experiment results show that FormLM significantly outperforms general-purpose language models on all tasks, with an improvement by 4.71 on Question Recommendation and 10.6 on Block Type Suggestion in terms of ROUGE-1 and Macro-F1, respectively.
translated by 谷歌翻译
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.
translated by 谷歌翻译
言语的数字,例如隐喻和讽刺,在文学作品和口语对话中无处不在。这对自然语言理解构成了巨大的挑战,因为语音的数字通常偏离表面上表达更深层次的语义含义的含义。先前的研究强调了数字的文学方面,很少从计算语言学的观点提供全面的探索。在本文中,我们首先提出了象征性单元的概念,该单元是人物的载体。然后,我们选择了中文常用的12种类型的数字,并构建中文语料库以进行上下文化的图形识别(配置)。与以前的令牌级别或句子级别对应物不同,配置旨在从话语级别的上下文中提取象征性单元,并将象征性单元分类为正确的图类型。在配置时,设计了三个任务,即图形提取,图类型分类和图形识别,并使用最新技术来实现基准。我们进行彻底的实验,并表明所有三个任务对于现有模型都充满挑战,因此需要进一步研究。我们的数据集和代码可在https://github.com/pku-tangent/configure上公开获取。
translated by 谷歌翻译
在本文中,我们重新审视了从单线图中自动重建3D对象的长期问题。以前的基于优化的方法可以生成紧凑而准确的3D模型,但是它们的成功率在很大程度上取决于(i)确定一组真正的真正几何约束的能力,以及(ii)为数值优化选择一个良好的初始值。鉴于这些挑战,我们建议训练深层神经网络,以检测3D对象中几何实体(即边缘)之间的成对关系,并预测顶点的初始深度值。我们在大型CAD模型数据集上进行的实验表明,通过利用几何约束解决管道中的深度学习,基于优化的3D重建的成功率可以显着提高。
translated by 谷歌翻译
我们设计了神经动力状态估计(Neuro-DSE),这是一种基于学习的动态状态估计(DSE)算法,用于未知子系统下网络微电网(NMS)。我们的贡献包括:1)具有部分未识别的动态模型的NMS DSE的数据驱动的神经-DSE算法,该算法将神经异常 - 差异方程式(ODE-NET)融合到Kalman滤波器中; 2)一种自动过滤,增强和校正框架,可以在有限和嘈杂的测量下实现数据驱动DSE的自我修复神经-DSE算法(Neuro-DSE+); 3)一种神经-Kalmannet-DSE算法,该算法将Kalmannet与Neuro-DSE进一步整合在一起,以缓解基于神经和物理的动态模型的模型不匹配; 4)增强的神经-DSE,用于NMS状态和未知参数的联合估计(例如,惯性)。广泛的案例研究表明,在不同的噪声水平,控制模式,电源,观察力和模型知识下,神经-DSE及其变体的疗效。
translated by 谷歌翻译